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Abstract 

Computational neuroscience is being revolutionized with the advent of multi-electtode arrays that provide real-time, 
dynamic, perspectives into brain function. Mining event streams from these chips is critical to understanding the firing 
patterns of neurons and to gaining insight into the underlying cellular activity. We present a GPGPU solution to mining 
spike ttains. We focus on mining frequent episodes which captures coordinated events across time even in the presence of 
intervening background/"junk" events. Our algorithmic contributions are two-fold: MapConcatenate, a new computation- 
to-core mapping scheme, and a two-pass elimination approach to quickly find supported episodes from a large number 
of candidates. Together, they help realize a real-time "chip-on-chip" solution to neuroscience data mining, where one 
chip (the multi-electrode array) supplies the spike train data and another (the GPGPU) mines it at a scale unachievable 
previously. Evaluation on both synthetic and real datasets demonstrate the potential of our approach. 

Keywords: Frequent episode mining, GPGPU, spike train datasets, computational neuroscience. 

1 Introduction 

Temporal (symbolic) event streams are popular in scientific domains, such as physical plants, medical diagnostics, and 
neuroscience. In all these domains, we are given occurrences of events of interest over a time course and the goal is to 
identify trends and behaviors that serve discriminatory or descriptive purposes. In this paper, we focus exclusively on event 
streams gathered through multi-electrode array (MEA) chips for studying neuronal function although our algorithms and 
implementations are applicable to a wider variety of domains. 

An MEA records spiking action potentials from an ensemble of neurons which after various pre-processing steps, yields 
a spike train dataset providing real-lime, dynamic, perspectives into brain function (see Figure 1). Identifying sequences 
(e.g., cascades) of firing neurons, determining their characteristic delays, and reconstructing the functional connectivity 
of neuronal circuits are key problems of interest. This provides critical insights into the cellular activity recorded in the 
neuronal tissue. 
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Figure 1: (left) Chip-on-Chip neuroscience: spike trains recorded from a multi-electrode array (MEA) are mined by a 
GPGPU to yield frequent episodes which can be summarized to reconstruct the underlying neuronal circuitry, (right) 
Architecture of the NVIDIA GTX280 GPGPU in detail. 

With rapid developments in instrumentation and data acquisition technology, the size of event streams recorded by 
MEAs has concomitantly grown, leading us to exhaust the abilities of sequential computation. For instance, just a few 
minutes of cortical recording from a 64-channel MEA can easily generate millions of spikes! It has thus become imperative 
to enable fast, near real-time computation, data mining, and visuaUzation of mined patterns. This mapping is critical as it 
can result in orders-of-magnitude difference in performance. 

We adopt GPGPUs as the platform of choice for neuroscience data mining for several reasons. First, they are partic- 
ularly economical in morphing a desktop into a rather powerful parallel computing machine (price-to-peiformance ratio). 
This is especially attractive to neuroscientists who might not have access to clusters of workstations. At the same time, 
mapping our application to a GPGPU architecture is non-trivial. GPGPUs were originally designed for data-parallel ap- 
pUcations (e.g., rendering), not quite the class of problems that temporal data mining falls under. Our main contributions 
are as follows: 

1. MapConcatenate, a computation-to-core mapping scheme suitable for frequent episode mining that takes advantage 
of the computational power of hundreds of processing cores on a GPU. MapConcatenate is a two-stage scheme like 
MapReduce [1] but quite different in the intent and appUcabiUty of both the stages. 

2. A two-pass elimination approach to find supported temporal episodes from a large number of candidates. The 
first pass, which is conducted with relaxed timing constraints, can eliminate most of the non-supported episode 
candidates with the accurate count computed using the second pass. Design of this elimination approach and proving 
its correctness is non-trivial since the set of episodes returned by the relaxed set of constraints is not a superset of 
the original set of constraints but the counts are indeed an upper bound. 

3. A 'chip-on-chip' real-time solution to neuroscience data mining where one chip (the MEA) provides the data and 
another chip (the graphics processor) mines it. Our solution is not a complete data streaming solution; nevertheless, 
we achieve real-time responsiveness by processing partitions of the data stream in turn. 
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2 Problem Statement 



A spike train dataset can be modeled as an event stream, where each symbol/event type corresponds to a specific neuron 
(or clump of neurons) and the dataset encodes occurrence times of these events over the time course. 

Definition 2.1. An event stream is denoted by a sequence of events {{Ei, ti), {E2, (-2): ■ ■ ■ (En, tn)), where n is the total 
number of events. Each event [Ei, ti) is characterized by an event type Ei and a time of occurrence ti. The sequence is 
ordered by time i.e. ti < tj+iVi = I, ... ,n — I and Ei 's are drawn from a finite set ^. 

One class of interesting patterns that we wish to discover are frequently occurring groups of events (i.e., firing cascades 
of neurons) with some constraints on ordering and timing of these events. This is captured by the notion of episodes, the 
original framework for which was proposed by Mannila et al [9]. 

Definition 2.2. An (serial) episode a is an ordered tuple of event types Va C ^. 

For example (A ^ B ^ C ^ D) is a 4-node serial episode, and it captures the pattern that neuron A fires, followed 
by neurons B, C and D in that order, but not necessarily without intervening 'junk' firings of neurons (even possibly 
neurons A, B, C, or D). This ability to intersperse noise or don't care states, of arbitrary length, between the event symbols 
in the definition of an episode is what makes these patterns non-trivial, useful, and comprehensible. 

Frequency of episodes: The notion of frequency of an episode can be defined in several ways. In [9], it is defined as the 
fraction of windows in which the episode occurs. Another measure of frequency is the non-overlapped count which is the 
size of the largest set of non-overlapped occurrences of an episode. Two occurrences are non-overlapped if no event of one 
occurrence appears in between the events of the other. In the event stream of Fig. 2, there are at most two non-overlapped 
occurrences of the episode A^ B, although there are 8 occurrences in total. 

We use the non-overlapped occurrence count as the frequency measure of choice due to its strong theoretical properties 
under a generative model of events [7]. It has also been argued in [10] that, for the neuroscience application, counting 
non-overlapped occurrences is natural because episodes then correspond to causative, "syn-fire", chains that happen at 
different times again and again. 

Temporal constraints: Besides the frequency threshold, a further level of constraint can be imposed on the definition 
of episodes. In multi-neuronal datasets, if one would hke to infer that neuron A's firings cause a neuron B to fire, then 
spikes from neuron B cannot occur immediately or spontaneously after A's spikes due to axonal conduction delays. These 
spikes cannot also occur too much later than A for the same reason. Such minimum and maximum inter-event delays are 
conmion in other application domains as well. We hence place inter-event time constraints giving rise to episodes such as: 

In a given occurrence of episode A B C IsHa, ts, and tc denote the time of occurrence of corresponding event 
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types. A valid occurrence of the serial episode satisfies 



tlow < (*B - tA) < t\igh,t^ow < (*C - is) < thigh 

(In general, an A'^-node serial episode is associated with N — 1 inter-event constraints.) In Fig. 2, there is exactly one 
occurrence of the episode A ^^-^^ B ^^^^^ C satisfying the desired inter-event constraints (shown in dotted boxes). 

Problem 1: Given an event stream { {Ei ,ti)},i G {1 ... n}, a set of inter-event constraints / = { (tf^^ , i^jg/i] S 
{1 . . . I}, find all serial episodes a of the form: 

^ low^ hzghi \^low ' high i 

" = (-^(1) '-^(2) • • • " ^{N)) 

such that the non-overlapped occurrence counts of each episode a is > 6, a user-specified threshold. Here 
E(^ys are the event types in the episode a and (ij^^^^, , t\^jgi^] 's e / are the corresponding inter-event constraints. 



3 Prior Work 

We review prior work in three categories: mining frequent episodes, data mining using GPGPUs, and the map-reduce 
framework for large scale computations. 

Mining Frequent Episodes: The overall mining procedure for frequent episodes is based on level-wise mining. Within 
this framework there are two broad classes of algorithms: window-based [9] and state machine based [6, 7], and they 
primarily differ in how they define the frequency of an episode. The window based algorithms define frequency of an 
episode as the fraction of windows on the event sequence in which the episode occurs. The state machine based algorithms 
are more efficient and define frequency as the size of largest set of non-overlapped occurrences of an episode. Within the 
class of state machine algorithms, serial episode discovery using non-overlapped counts was described in [7], and their 
extension to temporal constraints is given in [10]. With the introduction of temporal constraints the state machine based 
algorithms become more complicated. They must keep track of what part of an episode has been seen, which event is 
expected next and, when episodes inter-leave, they must make a decision of which events to be used in the formation of an 
episode. 

Data Mining Using GPGPUs: Many researchers have harnessed the capabilities of GPGPUs for data mining. The key to 
porting a data mining algorithm onto a GPGPU is to, in one sense, "dumb it down"; i.e., conditionals, program branches, 
and complex decision constraints are not easily parallelizable on a GPGPU and algorithms using these constraints will 
require significant reworking to fit this architecture (temporal episode mining unfortunately falls in this category). There 
are many emerging publications in this area but due to space restrictions, we survey only a few here. The SIGKDD tutorial 
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by Guha et al. [4] provides a gentle introduction to the aspects of data mining on GPGPUs through problems hke k-means 
clustering, reverse nearest neighbor(RNN), discrete wavelet transform(DWT), sorting networks, etc. In [2], a bitmap 
technique is proposed to support counting and is used to design GPGPU variants of Apriori and k-means clustering. This 
work also proposes co-processing for itemset mining where the complicated tie data structure is kept and updated in the 
main memory of CPU and only the itemset counting is executed in parallel on GPU. A sorting algorithm on GPGPUs 
with applications to frequency counting and histogram construction is discussed in [3] which essentially recreates sorting 
networks on the GPU. Li et al. [8] present a 'cut-and-stitch' algorithm for approximate learning of Kalman filters. Although 
this is not a GPGPU solution per se, we point out that our proposed approach shares with this work the difficulties of mining 
temporal behavior in a parallel context. 

MapReduce: Modeled after LISP primitives, MapReduce [1] provides a distribution framework for large scale compu- 
tations using two fimctions: map, and reduce. It has received a fair share of attention (and, sometimes, criticism) from 
the cloud computing and data-intensive computing communities. The framework has been ported to many platforms, 
including GPGPUs (e.g., see [5]). See Section 5.2.2 for comparisons between our proposed framework and MapReduce. 



4 GPGPU Architecture 

To understand the algorithmic details behind our approach, we provide a brief overview of GPGPU and its architecture. 

The initial purpose of specialized GPUs was to accelerate the display of 2D and 3D graphics, much in the same 
way that the FPU focused on accelerating floating-point instructions. However, the rapid technological advances of the 
GPU, coupled with extraordinary speed-ups of application "toy" benchmarks on the speciaUzed GPUs, led GPU vendors 
to transform the GPU from a speciaUzed processor to a general-purpose graphics processing unit (GPGPU), such as 
the NVIDIA GTX 280, as shown in Figure 1 (right). To lessen the steep learning curve, GPU vendors also introduced 
programming environments, such as the Compute Unified Device Architecture (CUD A). 

Processing Elements: The basic execution unit on the GTX 280 is a scalar processing core, of which 8 together form 
a multiprocessor. While the number of multiprocessors and processor clock frequency depends on the make and model 
of the GPU, every multiprocessor in CUDA executes in SIMT (Single Instruction, Multiple Thread) fashion, which is 
similar in nature to SIMD (Single Instruction, Multiple Data) execution. Specifically, groups of 32 threads form a warp 
and execute the exact same codepath until every thread terminates. However, when codepaths diverge, each thread must 
now execute every instruction on every thread path, therefore, implying that optimal performance is attained when all 32 
threads do not branch down different codepaths. The execution of a single instruction with 8 cores and a warp size of 32 
completes in 4 cycles. 

Memory Hierarchy: The GTX 280 contains multiple forms of memory. Beginning with the furthest from the GPU 
processing elements, the device memory is located off-chip on the graphics card and provides the main source of storage 
for the GPU while simultaneously being accessible from the CPU and GPU. Each multiprocessor on the GPU contains 
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three caches — a texture cache, constant cache, and shared memory. The texture cache and constant cache are both 
read-only memory providing fast access to immutable data. Shared memory, on the other hand, is read-write to provide 
each core with fast access to the shared address space of a thread block within a multiprocessor. Finally, on each core 
resides a plethora of registers such that there exists minimal reliance on local memory resident off-chip on the device 
memory. 

Parallelism Abstractions: At the highest level, the CUDA programming model requires the programmer to offload 
functionaUty to the GPGPU as a compute kernel. This kernel is evaluated as a set of thread blocks logically arranged 
in a grid to facilitate organization. In tum, each block contains a set of threads, which will be executed on the same 
multiprocessor, with the threads scheduled in warps, as mentioned above. 

5 Algorithms 

Our overall approach for solving Problem 1 is based on a state machine algorithm with inter-event constraints [10]. There 
are two major phases of this algorithm: generating episode candidates and counting these episodes, and we focus on 
the latter since it is the key performance bottleneck, typically by a few orders of magnitude. Therefore, we present a 
GPU-accelerated algorithm for counting episodes of various size, while candidate generation is executed sequentially on 
a CPU. 

In this section, we first present the standard sequential algorithm for mining frequent episodes with inter-event con- 
straints. Next, we introduce our GPU based algorithm (called Al). We then propose a two-pass counting approach, where 
the first counting pass eliminates most of unsupported episodes and the second coimting pass completes the counting tasks 
for the remaining episodes. Since the first counting pass uses a less complex algorithm (called A2), the execution time 
saved at this step contributes to an overall performance gain even we go through two-pass coimting {A2 -\- Al). 

5.1 Episode Mining with Inter-Event Constraints 

Algorithm 1 outlines the serial counting procedure for a single episode a. Briefly, it maintains a data structure s which is 
a Ust of Usts. Each Ust s[k\ in s corresponds to an event type E(^]^) € a and stores the times of occurrences of those events 
with event-type E^^j.) which satisfy the inter-event constraint {t^u>w^\'t%igh'] \Qd&\. one entry tj e s[k — 1]. This 

requirement is relaxed for s[0] , thus every time an event £J(o) is seen in the data its occurrence time is pushed into s[0] . 

When an event of type -B(fe),2 < k < N at time t is seen, we look for an entry tj G s[k — 1] such that t — 
^3 ^ (^loiu ^Wg?!^]- Therefore, if we are able to add the event to the Ust s[k], it impUes that there exists at least one 
previous event with event-type -E(fe_i) in the data stream for the current event which satisfies the inter-event constraint 
between and Applying this argument recursively, if we are able to add an event with event-type to its 

corresponding Ust in s, then there exists a sequence of events corresponding to each event type in a satisfying the respective 
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inter-event constraints. Such an event marks the end of an occurrence after which the count for a is incremented and the 
data structure s is reinitialized. Figure 2 illustrates the data structure s for counting A ^^-i"' B '^^^ij^' C. 

The maximality of the left-most and inner- most occurrence counts for a general serial episode has been shown in [7]. 
Similar arguments hold for the case of serial episodes with inter-event constraints and not shown here for lack of space. 
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Figure 2: Illustration of the data structure s for counting A B C 



Algorithm 1 Serial Episode Mining 

Input: Candidate A^-node episode a = (-E(i) °- — ^-'^ ' ... i?(Ar)) and event sequence S = U)},i ^ {I . . . n}. 
Output: Count of non-overlapped occurrences of a satisfying inter-event constraints 
count — 0;s— [[],...,[]] //List of \a\ Usts 
for all {E, <) e S do 
for i ~ \a\ to 1 do 

— i*'' event type e a 
5: if £; = E(^i-) then 

6: Iprev — ^ 1 

7: if I > 1 then 

8: k — \s[iprev] I 

while fc > do 

iprev — ^{^prev: ^] 
" ''low ^ '■P'-ei' — '■high 

if z = |q:| — 1 then 

count + +; s = [[],..., []]; break Line: 3 
else 

s[i] = s[i] U t 
break Line: 9 
fc = fc- 1 

else 

s[i\ = s[i] Ut 
RETURN count 



5.2 GPU Mining Algorithm: Al 
5.2.1 PTPE: per-thread per-episode 

In order to design a parallely-executed algorithm for counting supports for AI episodes, we need to identify a set of 
independent computational tasks/units that can be mapped onto different computational cores of GPU. A standard approach 
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Figure 3: Illustration of splitting a data sequence into segments and counting within each segment. 

is to use M computational units, each of which is counting for one episode. We then map each computational unit to a logic 
GPU thread, which can execute in parallel with other threads. Because the counting of one episode in each computational 
unit is executed sequentially, like CPU, we simply implement Algorithm 1 as the CUDA kernel for each counting thread. 

5.2.2 MapConcatenate: multiple threads per episode 

When the number of episodes is smaller than the number of cores, a per-thread per-episode model is prone to inefficiencies 
in core usage , leading to higher execution times. Hence we seek to achieve a higher level of parallelism within the counting 
of a single episode. We segment the input event stream into segments of sub-streams and use each thread block to count one 
segment. The subcounts of all segements need to be concatenated together for the final count, and hence we refer to this 
two-step approach as "MapConcatenate" , which include a Map step, where each event segment is mapped to one thread 
block for local counting, and a Concatenate step, where the counting results from neighboring segments coresponding to 
episodes that straddle them are concatenated/merged together. 
MapReduce vs MapConcatenate 

This new computation-to-core mapping scheme, MapConcatenate , is quite different from MapReduce [1], in both 
stages. Whereas the map functions in MapReduce operate independently across segments, the map functions in Map- 
Concatenate require data access from adjacent segments. Similarly, whereas the results from any two segments can be 
"reduced" in MapReduce, only the results from two adjacent segments can be concatenated in MapConcatenate. 

Design of MapConcatenate 

When we divide the input stream into segments, there are chances that some occurrences of an episode span across the 
boundaries of consecutive segments. As an example, see Fig. 3 which depicts a data sequence divided into two segments. 
The shaded rectangles on the top mark the non-overlapped occurrences hi ... h4 of an episode {A ^ B ^ C) (assume 
for now that inter-event constraints are always satisfied), as seen by a state machine ao on the unified event sequence, ag 
is thus the reference (serial) state machine. Let ai and a2 be the state machines counting occurrences in segment 1 and 
segment 2 respectively. During the Map step, ai and a2 are executed in parallel, and each state machine can see a local 
view of the episode occurrences, which are shown by empty rectangles below the event sequence, ai sees the occurrence 
gi and a partial occurrence (72- For a2, it will first see the occurrence and therefore miss before moving onto 34. 

We propose Map and Concatenate steps that use multiple state machines in each segment, so that the counting of a 
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segment is able to anticipate partial occurrences near boundaries. Let us explain in detail why multiple state machines are 
necessary, and how we design the Map step and Concatenate step to maintain the correctness of counting. 

>E(N) 




Possible locations of split 

Figure 4: Illustration of different possibilities of an occurrence splitting across two adjacent data segments. 

Assume that we are counting an episode a = (-E(i) °- — >" ■ ■ ■ -B(jv))> the data sequence is divided into P segments, 
and events in the p*^ data segment are in the range (xp, Tp_|_i]. An occurrence of episode a can be split across two adjacent 
segments in at least N ways as shown in Figure 4. For each possible split, we need one state machine, ap,0 < k < A^— l,to 
count the second segment, starting att — Tp — J2i=i thigh- have N different state machines all counting occurrences 

of episode alpha using Algorithm 1, handling all possible cases of split between current segment and previous segment. 

I I 

Event sequence I Segmcnt-p ^P+l 



count^j 



countt 



count'. 



'p+1 ^ l^i^l '-high 



Figure 5: Illustration of a Map step. 

For each segment p, the Map step is designed as follows, and illustrated in Figure 5. 

1. Each state machine maintains its own count = county. 

2. ap does not increment count for occurrences ending at time t < Tp. 

3. Op stores the end time of the first occurrence that completes at time t, Tp < t < Tp + ^l^igh- ^^^^ '^p- 
If there is no such occurrence is set to Tp. 

4. ttp on reaching end of the segment, crosses over into the next segment to completes the current partial occurrence 
and continues until t < Tp+i + J2iLi^ t^high- '■^^ ^^'^ time, of this occurrence be hp. Note that count is not 
incremented for this occurrence. In case the occurrence cannot be completed hp is set to Tp+i. 

The result of the Map step for each segment p is tuples of {Up, count p, bp). Based on these results, we design our 
Concatenate step for pairs of consecutive segments as follows, and illustrated in Figure 6. 



1. Start Concatenate step at level 1. 
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[a\, countl + count^, 62) 
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{a\, countl + count\, 64) 
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Level 3* ^^i' '^'^'"^^i + count^ + county + countl, ^1) 

Figure 6: Illustration of a Concatenate step. 

2. For level i, concatenate the tuples of segment {j — 1)2' + 1 with the tuples of segment (j — 1)2' + 1 + 2-'^^ for 
all possible j. The procedure for concatenate the tuples of s*'' segment and t*''segment is: find all pairs of tuples 
(a^, count'l, b^l) and {al,count[, b^) such that fe^ — a[, and then concatenate these pairs to obtain the next level 
(i + 1) tuples (flj , count'l + county, b\) for s*''segment. 

3. After all adjacent segment pairs are concatenated for level i, increase the level to i + 1 and repeat the previous step 
until there is only one segment left for this level. 

It is worth mentioning that, at level i of Concatenate step, segment (j — 1)2* + 1 and segment {j — 1)2' + 1 + 2^~^ are 
considered as adjacent segments. We also choose segment number p to be a power of 2, say 2'^, so that the Concatenate 
step takes exactly q + I levels and 2"^+^ — 1 concatenate operations to finish. 

5.2.3 A Hybrid Algorithm 

MapConcatenate can greatly out-perform the PTPE algorithm in cases when the number of episodes is small and some 
GPU cores are idle for the PTPE algorithm. For other cases, the PTPE algorithm would run much faster than MapCon- 
catenate. So, as a practical approach, we would like to use their selective superiorities to automatically decide the right 
algorithm to execute. The logic for making these choices can be very simple: if the GPU can be fully utilized with the 
PTPE algorithm, then we choose it, else we choose MapConcatenate. In NVIDIA's CUDA framework, the GPU is fully 
utilized when: 

S > MP X Bmp X Tb (1) 

where S is the number of episodes to be counted, Tg is the number of threads per block as defined by the algorithm, MP 
is the number of multiprocessors on the GPGPU, and Bm p is the number of blocks that the compiler determines can be 
fit into one multi-processor 
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Algorithm 2 GPU Mining Algorithm (Al) 
1: itS> MP X Bmp xTbX f{N) then 
2: Call PTPE Algorithm 
3: else 

4: Call MapConcatenate Algorithm 

However, another key factor that determines the selective superiority is the size/length (N) of the episode being 
counted. For example, MapConcatenate will run slower on larger N, since the Map step needs N state machines to 
count each event segment, forcing the Concatenate step to take more time to concatenate the state machines together. The 
performance of the PTPE algorithm is also dependent on N, but will not change as much as MapConcatenate, since it 
only uses one state machine to count each episode. Therefore, we need to consider the effect of N when deciding which 
algorithm to use: 

S > MP X Bmp x Tb x f{N) (2) 

where f{N) is a performance penalty factor dependent on the episode length/level of the mining algorithm. We define the 
cross-over point as the number of episodes beyond which we estimate that the PTPE algorithm will outperform MapCon- 
catenate. 

By considering both GPU utilization and episode size, we propose the Hybrid algorithm (Al) for mining temporal 
episodes on the GPU as shown in algorithm 2. 

5.3 A Two-Pass Elimination Approach: A2+A1 

After a thorough analysis of GPU performance of algorithm Al, we find the performance is largely Umited by the require- 
ment of large amount of shared memory and large number of GPU registers for each GPU thread. For example, if the 
episode size is 5, each thread requires 220 bytes of shared memory and 97 bytes of register file. It means that only 32 
threads can be allocated on a GPU multi-processor, which has 16K bytes of shared memory and register file. When each 
thread requires more resources, only fewer threads can run on GPU at the same time, resulting in more execution time for 
each thread. 

To address this problem, the only way is to reduce the complexity of the algorithm without losing correctness. In 
this section, we introduce a two-pass elimination approach that more efficiently searches larger numbers of episodes, 
further improving the overall performance. The idea is to use a far less complex algorithm, A2, to eliminate most of 
non-supported episodes, and only use the complex Al algorithm to detemiine if the rest of episode is supported or not. In 
order to introduce algorithm A2, we consider the solution to a sUght relaxed problem, which plays an important role in our 
two-pass elimination approach. 
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5.3.1 Less-Constrained Mining: Algorithm A2 

Let us consider a constrained version of Problem 1 . Instead of enforcing both lower-limits and upper-limits on inter-event 
constraints, we design a counting solution that enforces only upper limits. 

Let a' be an episode with the same event types as in a, where a uses the original episode definition from Problem L 
The lower bounds on the inter-event constraints in a are relaxed for a! as shown below. 

Observation 5.1. In Algorithm 1, if lower-bounds of inter-event constraints in episode a are relaxed as a', the list size of 
s[k], 1 < k < N can be reduced to 1. 



Proof. In Algorithm 1, when an event of type is seen at time t while going down the event sequence, s[Ei^i^_i-^] is 

,(k-i) 

high 



looked up for at least one t\ \ such that t - t\ ^ e (O.^^g^'']- Note that t\ ^ represents the i^^ entry of s[E(^k-i)] 



corresponding the {k — 1)*'' event-type in a. 

Let s[i^(fe_i)] = {t\~^ . . . t^^} and t^~^ be the first entry which satisfies the inter-event constraint (0, t^^^^^], i.e., 

o<i-t^i<ii^;^^) (3) 

Also Equation 4 below follows from the fact that t'l~^ is the first entry in s[£J(/;_i)] matching the time constraint. 

^fe-i ^ ^fe-i < Vi e {i + 1 . . . m} (4) 
From Equation 3 and 4, Equation 5 follows. 

Q<t-t]-'< 4^;^^^ , Vj e {i + 1 . . . m} (5) 

This shows that every entry in s[S(fe_i)] following t'l~^ also satisfies the inter-event constraint. This follows from the 
relaxation of the lower-bound. Therefore it is sufficient to keep only the latest time stamp t'^^ only in s[£^(fe_i)] since it 
can serve the purpose for itself and all entries above/before it, thus reducing s i)] to a single time stamp rather than a 
hst (as in Algorithm 1). □ 



5.3.2 Combined Algorithm: A2+A1 

Now, we can return to the original mining problem (with both upper and lower bounds). By combining Algorithm A2 
with Algorithm Al, we can develop a two-pass elimination approach {A2 -\- Al) that can deal with the cases on which 
Algorithm Al cannot be executed. Algorithm A2 + Al is as follows. 
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Algorithm 3 Less-Constrained Mining: A2 



Input: Candidate episode a. = (£^(i) — * ...£^(jv)) is a AT-node episode, event sequence S = {{Ei,ti)},i G 
{l...n}. 

Output: Count of non-overlapped occurrences of a 
count = 0; s = [] //List of |a| time stamps 
for all (E, t) e S do 
for i = \a\ to 1 do 

= i*^ event type e a 
\iE = Ei^i) then 

if i > 1 then 

ift-s[Ve.]<fe''^then 
if i = |q| then 

count + +; s = []; break Line: 3 
else 

s[i\ = t 

else 

s[i] = t 
Output: count 



Algorithm 4 Combined Algorithm: A2+A1 
1: (First pass) For each episode a, run A2 on its less-constrained counterpart, a' . 
2: Eliminate every episode a, if count{a') < CTh, where CTh is the support count threshold. 
3: (Second Pass) Run Al on each remaining episode, a, with both inter-event constrains enforced. 



Algorithm A2 + Al yields the correct solution for Problem 1. Although the set of episodes mined under the less 
constrained version are not a superset of those mined under the original problem definition, we can show the following 
result: 

Theorem 5.1. count{a') > count{a), i.e., the count obtained from Algorithm A2 is an upper-bound on the count obtained 
from Algorithm Al. 

Proof. Let /i be an occurrence of a. Note that /i is a map from event types in a to events in the data sequence S. Let the 
time stamps for each event type in h be {t*^^' . . . t'^^^}. Since h is an occurrence of a, it follows that 

t\o^ < t^'^ - 1^'-'^ < tligH^i e {1 . . .fc - 1} (6) 

Note that tl^^ > 0. The inequality in Equation 6 still holds after we replace tl^^ with to get Eqn.7. 

0<t«_t(i-i)<ij^.^^,V*G{l...fc-l} (7) 

The above corresponds to the relaxed inter-event constraint in a'. Therefore every occiurence of a is also an occurrence 
of a' but the opposite may not be true. Hence we have that count{a') > count{a). □ 



13 



In our two-pass elimination approach, Algorithm A2 is less complex and runs faster than Algorithm Al, because it 
reduces the time complexity of the inter-event constraint check from 0(|s[i?(fc_i)] |) to 0(1). Therefore, the performance 
of Algorithm A2 + Al is significantly better than Algorithm Al when the number of episodes is very large and the number 
of episodes culled in the first pass is also large as shown by our experimental results described next. 

6 Experimental Results 

6.1 Datasets and Testbed 

6.1.1 Dataset Description 

Our datasets are drawn from both mathematical models of spiking neurons as well as real datasets gathered by Wagenar et 
al. [11] in their analysis of cortical cultures. Both these sources of data are described in detail in [10]. The mathematical 
model involves 26 neurons (event types) whose activity is modeled via inhomogeneous Poisson processes. Each neuron 
has a basal firing rate of 20Hz and two causal chains of connections — one short and one long — are embedded in the 
data. This dataset {Sym26) involves 60 seconds with 50,000 events. The real datasets (2-1-33, 2-1-34, 2-1-35) observe 
dissociated cultures on days 33, 34, and 35 from over five weeks of development. The original goal of this study was to 
characterize bursty behavior of neurons during development. 

6.1.2 Testbed and Runtime Parameters 

We evaluated the performance of our GPU algorithms on a machine equipped with Intel Core 2 Quad 2.33 GHz and 4GB 
system memory. We used a NVIDIA GTX280 GPU, which has 240 processor cores with 1.3 GHz clock for each core, and 
1GB of device memory. 

There are two CUDA runtime parameters we need to determine for each execution on the GPU: number of threads 
per block, T, and the total number of blocks. The second parameter is always calculated based on T so that all required 
computation can be finished with T threads per block and within one CUDA kernel function call. 

Parameter T is determined by the algorithm and the size of the episode (A^). For the PTPE algorithm, we calculate 
the maximum number of threads per block at each N . The larger N is, more shared memory is needed per thread. When 
A'^ = 1, we use 128 threads, and as A'' increases, the maximum number of threads per block decrease due to the shared 
memory limit. When A?" = 6, we cannot have more than 32 threads per block. For MapConcatenate, the event stream is 
segmented into a number {R) of sub-streams, as mentioned in Section 5.2.2. Recall that we need to create multiple threads 
to count all sub-streams independentiy and run multiple state machines, as shown in Figure 5. The number of threads for 
each block T can be calculated asT = R x N, since there are R sub-streams and N state machines. Again, we must 
hmit the number of sub-streams to reduce the number of threads due to the shared memory limit affected by N. For the 
Algorithm A2, we generate as many threads as possible per block until shared memory usage reaches the hardware hmit 
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(16KB). In this case, T is normally much larger than 32, since we do not have a strict memory requirement for the GPU 
for Algorithm A2. 



6.2 PIPE vs MapConcatenate 

Execution Time on Support 100 




Episode Size 

(a) Run Times at Different Episode Sizes. 



Speedup of Hybrid over PIPE and MapConcatenate 




Support Threshold 

(b) Speedups at Different Support Thresliolds. 

Figure 7: Comparison of PTPE, MapConcatenate, and Hybrid on Sym26 dataset. 

Our first set of tests evaluated the three algorithms: PTPE, MapConcatenate, and the hybrid approach (Al). As seen 
in Figure 7(a), it is clear that blindly choosing to execute the PTPE or MapConcatenate approach for all levels is not the 
best solution. For episode sizes of 1, 2, and 5 both approaches complete in roughly the same amount of time. However, 
PTPE significantly outperforms MapConcatenate at episode sizes of 3 and 4 by 3.96X and 2.84X, respectively. On the 
other hand, PTPE performs slower than MapConcatenate for episodes of size 6 (1.32X) and 7 (2.63X). 

These crossover points exist in all of our tests for this dataset (see supplementary information), and for lack of space. 
Table 1 shows the crossover points determined experimentally for the Sym26 dataset. 

Table 1 : Crossover Points on number of episodes below which MapConcatenate should be run (for the fewer episodes 
case). For other episode sizes — 1, 2, and >8 — MapConcatenate should be chosen. 



Level 


3 


4 


5 


6 


7 


8 


Crossover 


415 


190 


200 


100 


100 


60 



Recall Equation 2 in Section 5.2.3 where optimal execution occurs when the GPU is fully utilized and a small factor 
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of episode size is taken into account. Using the table above, with M = 30, Tb = 32, and Bm = 1, we find that f{N) of 
the form + 6 is a better fit than ax N + bas seen in Figure 8. 

Crossover Points for Different Episode Sizes 




Episode Size 

Figure 8: Crossover points fitted to + b and a x N + b 

With these crossover points determined, the hybrid approach was evaluated on the same support thresholds and the 
speedup for this approach over both PTPE and MapConcatenate is visible in Figure 7(b). The range of improvement over 
PTPE is almost 3X and over 4.5X for MapConcatenate. When the number of episodes is large (i.e., low support threshold) 
there are enough episodes to fully utilize the GPU and as such Hybrid shows little improvement over PTPE. Conversely, 
Hybrid shows little improvement over MapConcatenate when the support threshold is high with very few episodes to 
search for. 

6.3 One-Pass (Al) vs Two-Pass (A2+A1) 

As stated in Section 5.3, the performance of algorithm Al suffers from the requirement of large shared memory and 
large register file, especially when the episode size is big. So we introduce algorithm A2 that can eliminate most of the 
non-supported episodes and requires much less shared memory and register file, then the complex algorithm Al can be 
executed on much fewer number of episodes, resulting in performance gains. The amount of elimination that A2 conducts 
can greatly affect the execution time at different episode sizes. In Figure 9(a), the two-pass algorithm eliminates over 
99.9% (43634 out of 43656) of the episodes of size four The end result is a speedup of 3.6X over the hybrid Al algorithm 
for this episode size and an overall speedup for this support threshold of 2.53X. Speedups for three different datasets at 
different support thresholds are shown in Figure 9(b) where in every case, the two-pass approach outperforms the one-pass 
approach with speedups ranging from 1.2X to 2.8X. 

We also use CUDA Visual Profiler to analyze the execution of algorithm Al and A2 to give a quantitative measurement 
of how A2 out-performs Al on the GPU. We have analyzed various GPU performance factors, such as GPU occupancy, 
coalesced global memory access, shared memory bank conflict, divergent branching, and local memory loads and stores. 
We find the last two factors are primarily attributed to the performance difference between Al and A2, which are shown in 
Figure 10. Algorithm Al requires 17 registers and 80 bytes of local memory for each counting thread, while algorithm A2 
only requires 13 registers and no local memory. Since local memory is used as supplement for registers and mapped onto 
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Execution Time on Support 3600 




Episode Size 

(a) Execution time of Two-Pass and One-Pass algoritlims for Support=3600 on Dataset 2-1-35 at dif- 
ferent episode sizes. 
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(b) Speedup of Two-Pass algorithm over One-Pass for multiple support thresholds on multiple datasets. 



Figure 9: Execution time and speedup comparison of One-Pass algorithm versus Two-Pass algorithm. 



global memory space, it is accessed very frequently and has the same high memory latency as global memory. In Figure 10 
(a), the total amount of local memory access of both two-pass approach and one-pass approach comes from algorithm Al. 
Since algorithm A2 eliminates most of the non-supported episodes and requires no local memory access, the local memory 
access of two-pass approach is much less than one-pass approach when the size of episode increases. At the size of 4, A2 
eliminates all episode candidates, thus there is no Al execution and no local memory access, resulting a large performance 
gain for two-pass approach over one-pass. As shown in Figure 10 (b), the amount of divergent branching also affects the 
GPU performance difference between the two-pass approach and the one-pass approach. 




Size 2 Sjze i Size 4 Size 2 Size 3 Size 4 



(a) (b) 

Figure 10: Comparison between algorithm Al and algorithm A2 for support threshold 1650 on dataset 2-1-33. (a) Total 
number of loads and stores of local memory, (b) Total number of divergent branches. 
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6.4 GPU vs. CPU 

To demonstrate the performance gain of our GPU approach, we implemented Algorithm 1 for a quad-core CPU with four 
threads, and tested on our testbed machine. The CPU implementation is written in C++ and optimized for sequentially 
executed applications. For example, since each CPU thread counts a large number of episodes, we can read the event 
stream exactly once for each thread, and update all state machines in that thread with each event. In addition, we used 
an acceleration structure to speed up the search for which the state machine needs to be updated. In addition, we also 
implement two-pass (A2 + Al) elimination approach on CPU to get a fair performance comparison against our GPU 
implementation. 

Compared with the performance of the CPU implementation, our GPU algorithms exhibit a significant speedup, as 
shown in Figure 11. For the 2-1-35 dataset, the speedup is approximately 15-fold for a support threshold of 2700. 

6.5 Mining Evolving Neuronal Circuits 

Due to space limitations, we are unable to discuss the biological significance of our mined episodes. In addition to under- 
standing the activity of an ensemble of neurons, one of the advantages of our GPGPU solution is to mine evolving cul- 
tures, and watch the progression of neuronal development in real-time. Please see our supplemental website http://neural- 
code.cs.vt.edu/gpgpu for videos depicting how our mining reveals key characteristics of cortical culture growth. 
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Figure 1 1 : Speedup of GPU implementation over the CPU implementation 



7 Discussion 

We have presented a powerful and non-trivial framework for conducting frequent episode mining on GPGPUs and shown 
its capabilities for mining neuronal circuits in spike train datasets. For the first time, neuroscientists can enjoy the benefits 
of data mining algorithms without needing access to costly and specialized clusters of workstations. Our supplementary 
website (http://neural-code.cs.vt.edu/gpgpu) provides auxiliary plots and videos demonstrating how we can track evolving 
cultures to reveal the progression of neural development in real-time. 
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Our future work is in four categories. First, our experiences with the neuroscience apphcation have opened up the 
interesting topic of mapping finite state machine based algorithms onto GPGPUs. A general framework to map any 
finite state machine algorithm for counting will be extremely powerful not just for neuroscience but for many other areas 
such as (massive) sequence analysis in bioinformatics and linguistics. Second, the development of MapConcatenate 
highlights the importance of developing new, additional, programming abstractions specifically geared toward data mining 
on GPGPUs. Third, we found that the two-pass approach performs significantly better than running the complex counting 
algorithm on the entire input. The first pass solves a simpler problem, and generates an upper bound of the solution to the 
original problem. This significantly reduces the input size for the complex second pass, speeding up the entire process. 
Finally, we wish to integrate more aspects of the apphcation context into our algorithmic pipehne, such as candidate 
generation, streaming analysis, and rapid "fast-forward" and "slow-play" facihties for visualizing the development of 
neuronal circuits. 
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